Description

In the present assignment students are expected to choose one pair of compounds a suggested list of moieties representing a start and an end compounds to trace matebolic changes undelying their interconversion. The main aim of the assignment is stated as following: learning ‘how to reconstruct a metabolic pathway using comparative genomics techniques: (1) gene neighborhood analysis; (2) domain fusion analysis; (3) phyletic gene pattern’. Respective pathways should then be traced in two bacterial families: Enterobacteriaceae (including a model Gram-negative species Escherichia coli) and Bacilliaceae (comprising Bacillus subtilis, a model Gram-positive bacterium). For my presonal assignment I choose a pair consisting of pyruvate as a starting molecule and L-valine as an end product because tracing conversion in this pair seemed to be quite a trivial task.

Task #1. View pathway in the KEGG database

Selected compounds were searched for in KEGG Compound database (http:,,www.genome.jp,kegg,compound,) with the following search output:

##   Trivial name Compound ID Empirical formula
## 1     Pyruvate      C00022            C3H4O3
## 2     L-valine      C00183          C5H11NO2

With the identifiers obtained I sought for the pathways involving both these compounds at KEGG Pathway. Search for these identifiers in the general (map) database resulted in the following pathways identified:

##      Entry                                                     Name
## 1 map01060              Biosynthesis of plant secondary metabolites
## 2 map05230                      Central carbon metabolism in cancer
## 3 map01063 Biosynthesis of alkaloids derived from shikimate pathway
## 4 map01100                                       Metabolic pathways
## 5 map00290              Valine, leucine and isoleucine biosynthesis
## 6 map00770                        Pantothenate and CoA biosynthesis
## 7 map01110                    Biosynthesis of secondary metabolites
## 8 map01210                          2-Oxocarboxylic acid metabolism
## 9 map01230                              Biosynthesis of amino acids

Apparently, some of these pathways imply an undesirably high level of abstraction, while others denote metabolic pathways absent in bacteria. For further analysis I chose ‘Valine, leucine and isoleucine biosynthesis’ pathway (ID: map00290) as it is restricted to the synthesis of the amino acids sororital to L-valine solely. For the sake of brevity the pathway would be hereafter referred to as ‘valine biosynthesis’, and its branches leading to biosynthesis of amino acids other than valine would be ignored. The selected pathway was then analyzed for presence of respective enzyme-encoding genes in E. coli and B. subtilis.

Escherichia coli

Bacillus subtilis

These data were further summarized in the metabolic pathway flowchart. For the sake of consistency, 2-acetolactate mutase absent in both species is ommited.

Task 2. Compare pathways in PATRIC

Taxonomy identifiers for suggested bacterial species were obtained from NCBI taxonomy as stated in the following tables:

Enterobacteriaceae

##                                                       Species name Taxonomy ID
## 1                        Escherichia coli str. K-12 substr. MG1655      511145
## 2 Salmonella enterica subsp. enterica serovar Typhimurium str. LT2       99287
## 3                                  Citrobacter koseri ATCC BAA-895      290338
## 4                                           Yersinia pestis KIM10+      187410
## 5                                        Edwardsiella tarda EIB202      498217
## 6                                     Erwinia amylovora ATCC 49946      716540
## 7                                         Proteus mirabilis HI4320      529507

Bacilliaceae

##                                 Species name Taxonomy ID
## 1 Bacillus subtilis subsp. subtilis str. 168      224308
## 2                 Bacillus cereus ATCC 14579      226900
## 3                   Bacillus clausii KSM-K16       66692
## 4                  Bacillus halodurans C-125      272558
## 5              Bacillus licheniformis DSM 13      279010
## 6                  Bacillus pumilus SAFR-032      315750

The suggested genomes were selected as queries for 00290 pathway mapping at PATRIC database as suggested in the task wording. In Enteroacteriaceae batch two species, namely Y. pestis and E. tarda were found to be devoid of any genes involved in the pathway (data not shown); in other species all genes were present by at least one copy, and one gene encoding actolactate synthase (EC 2.2.16) was present in more than three paralogous copies in all the species. Speaking of Bacilliaceae, gene encoding valine-piruvate transaminase is absent in all species except for B. licheniformes while other are present in one (e.g. ketol-acid reductoisomerase), two (3-isopropylmalate dehydratase) or varying number of copies.

Task #3. Observe gene co-localization (physical linkage on chromosome)

We then elucidated ortholog distribution in the surveyed genomes by using MicrobesOnline as suggested in the assignment wording:

*The selected salmonella genome was absent in the database so a randomly picked genome for serovar Typhi was used for gene annotation

Consistently with the previously obtained data, these results indicate that genes encoding acetolactate synthase subunits are overrepresented in bacterial genomes, presumably because of multisubunit structure of the enzyme. They also agree on exclusiveness of leucine dehydrogenase genes for Bacilliaceae and alanine-synthesizing transaminase for Enterobacteriaceae. At the same time, several discrepancies were found. For instance, most of the genes indicated as absent by PATRIC were discovered during the MicrobesOnline survey. Also, a gene encoding for leucine dehydrogenase is absent in the reference B. subtilis genome though found in other strains of the same species as well as in several selected Bacilliaceae members. Beyond that, MicrobesOnline offers domain structure of the contained genes. For the sake of consistency only genes from E. coli and B. subtilis were checked for their domain content.

E. coli

##   Gene name             Found pfams
## 1      ilvH         PF01842,PF10369
## 2      ilvN                 PF01842
## 3      ilvB PF02776,PF00205,PF02775
## 4      ilvI PF02776,PF00205,PF02775
## 5      ilvC PF07991,PF01450,PF01450
## 6      yagF                 PF00920
## 7      ilvD                 PF00920
## 8      ilvE                 PF01063
## 9      avtA                 PF00155

B. subtilis

##   Gene name             Found pfams
## 1      ilvH         PF01842,PF10369
## 2      ilvB PF02776,PF00205,PF02775
## 3      alsS PF02776,PF00205,PF02775
## 4      ilvC PF02826,PF07991,PF01450
## 5      ilvD                 PF00920
## 6      ybgE                 PF01063
## 7      ywaA                 PF01063
## 8      bcd*         PF02812,PF00208

* since bcd gene was absent in the reference strain, information on the respective protein structure was obtained from Bacillus subtilis subsp. subtilis str. NCIB 3610 by proxy.

Task #4. Observe gene co-localization (physical linkage on chromosome)

Genes mined in the previous two tasks were then analyzed for physical lincage by neighborhood. The suggested procedures were carried out to visualize juxtaposing genes for each entry in each genome. The results are presented in the following table. Note that neighboring genes are joined by parentheses.

* genes not identified in the primary search but observed on manual proximity investigation

Apparently, in all the genomes at least one proximity group comprising of core metabolic genes is preserved. In some cases interspersed genes for acetolactate synthase subunits nucleate additional gene islands which may fall under the same regulon in a manner similar to that of core enzyme genes.

Task #5. Observe various levels of interactions between the observed gene

For the final I chose E. coli gene ilvD encoding dihydroxy-acid dehydratase (EC 4.2.1.9). Interaction network reconstruction with default settings resulted in ten-node primary shell with moderate connectivity:

Network overview

Network overview

Then, four parameters were consecutively left as sole source of interaction evidence:

Co-expression

Co-expression

Neighborhood

Neighborhood

Gene Fusion

Gene Fusion

Co-occurrence

Co-occurrence

It is clear that co-occurrence is the largest as well as densest of all four evidence-wise networks while fusion network indicating reading frame union events is the smallest one with two vertices and one edge only. The most rewarding part of this piece of analysis is concordance between neighborhood network and previous proximity survey which underpins the idea that core enzymes of valine biosynthesis are most likely to be grouped together within the genome.

Conclusion

Despite the lack of convenient APIs, the exploited databases offer great opportunities for comparative genomics. In the case of valine synthesis from pyruvate precursor one could not only trace the metabolic pathway in different bacteria but also dissect similarities in gene content, spatial occurrence and regulation mechanisms among these species.

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=ru_RU.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=ru_RU.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=ru_RU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] magrittr_1.5 dplyr_0.8.5 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3       crayon_1.3.4     digest_0.6.25    assertthat_0.2.1
##  [5] R6_2.4.1         evaluate_0.14    pillar_1.4.3     rlang_0.4.5     
##  [9] stringi_1.4.6    rmarkdown_2.1    tools_3.6.3      stringr_1.4.0   
## [13] glue_1.3.1       purrr_0.3.3      xfun_0.12        yaml_2.2.1      
## [17] compiler_3.6.3   pkgconfig_2.0.3  htmltools_0.4.0  tidyselect_1.0.0
## [21] knitr_1.28       tibble_2.1.3